Transfer Topic Modeling with Ease and Scalability 
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Abstract 

The increasing volume of short texts generated on so- 
cial media sites, such as Twitter or Facebook, creates 
a great demand for effective and efficient topic mod- 
eling approaches. While latent Dirichlct allocation 
(LDA) can be applied, it is not optimal due to its 
weakness in handling short texts with fast-changing 
topics and scalability concerns. In this paper, we pro- 
pose a transfer learning approach that utilizes abun- 
dant labeled documents from other domains (such as 
Yahoo! News or Wikipedia) to improve topic mod- 
eling, with better model fitting and result interpre- 
tation. Specifically, we develop Transfer Hierarchical 
LDA (thLDA) model, which incorporates the label 
information from other domains via informative pri- 
ors. In addition, we develop a parallel implementa- 
tion of our model for large-scale applications. We 
demonstrate the effectiveness of our thLDA model 
on both a microblogging dataset and standard text 
collections including AP and RCV1 datasets. 



1 Introduction 

Social-media websites, such as Twitter and Facebook, 
have become a novel real-time channel for people to 
share information on a broad range of subjects. With 
millions of messages and updates posted daily, there 
is an obvious need for effective ways to organize the 
data tsunami. Latent Dirichlet Allocation (LDA), as 
a Bayesian hierarchical model to capture the text gen- 
eration process [5] , has been shown to be very power- 



ful in modeling text corpus. However, several major 
characteristics distinguish social media data from tra- 
ditional text corpus and raise great challenges to the 
LDA model. First, each post or tweet is limited to a 
certain number of characters, and as a result abbre- 
viated syntax is often introduced. Second, the texts 
are very noisy, with broad topics and repetitive, less 
meaningful content. Third, the input data typically 
arrives in high- volume streams. 

It is known that the topic outputs by LDA com- 
pletely depend on the word distributions in the train- 
ing documents. Therefore, LDA results on blog and 
microblogging data would naturally be poor, with a 
cluster of words that co-occur in many documents 
without actual semantic meanings jTU [21] ■ Intu- 
itively, the generative process of LDA model can be 
guided by the document labels so that the learned 
hidden topics can be more meaningful. Therefore, in 
[U [T21 [H] , discriminative training of LDA has been 
explored and in particular, [SO] applied labeled LDA 
(1LDA) to analyze Twitter data by using hashtags 
as labels. This approach addresses our challenges 
to a certain extent. Even though supervised LDA 
gives more comprehensible topics than LDA, it has 
the general limitation of the LDA framework in that 
each document is represented by a single topic dis- 
tribution. This makes comparing documents diffi- 
cult when we have sparse, noisy documents with con- 
stantly changing topics. Furthermore, it is very dif- 
ficult to obtain the labels (e.g. hashtags) for con- 
tinuously growing text collections like social media, 
not to mention the fact that on Twitter application, 
many hashtags refer to very broad topics (e.g. "#in- 
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ternet", "#sales") and therefore could even be mis- 
leading when used to guide the topic models. [2] 
proposed hierarchical LDA model (hLDA), that gen- 
erates topic hierarchies from an infinite number of 
topics, which are represented by a topic tree, and 
each document is assigned a path from tree root to 
one of the tree leaves. hLDA has the capability to 
encode the semantic topic hierarchies, but when fed 
with noisy and sparse data such as the user generated 
short tweet messages, it is not robust and its results 
lack meaningful interpretations [TSJ [53] . 

Recently, hLDA has been studied in [3] to sum- 
marize multiple documents. They built a two-level 
learning model using hLDA model to discover simi- 
lar sentences to the given summary sentences using 
hidden topic distributions of the document. The dis- 
tance dependent CRP jS] studied several types of de- 
cay: window decay, exponential decay, and logistic 
decay, where customers' table assignments depend on 
external distances between them. Our paper aims to 
bridge the gap between short, noisy texts and their 
actual generation process without additional labeling 
efforts. At the same time, we develop parallel algo- 
rithms to speed up inference so that our model can 
be applicable to large-scale applications. 

In this paper, we propose a simple solution model, 
transfer hierarhicial LDA (thLDA) model. The basic 
idea is extracting human knowledge on topics from 
source domain corpus in the form of representative 
words that are consistently meaning across contexts 
or media and encode them as priors to learn the topic 
hierarchies in target domain corpus of the hLDA 
model. To extract source domain corpus, thLDA 
model utilizes related labeled documents from other 
sources (such as Yahoo! news or socially tagged web 
pages) and to encode the laels, we modified a nested 
Chinese Restaurant Process (nCRP) as guidance for 
inferring latent topics of target domains. We base 
our model on hierarchical LDA (hLDA) [2] mainly 
because hLDA has the natural capability to encode 
the semantic topic hierarchies with document clus- 
ters. In addition, recent study suggests that hierar- 
chical Dirichlet process provides an effective explana- 
tion model for human transfer learning [5]. 

The rest of the paper is organized as follows: In 
Section 2, we describe the proposed methodology in 



detail and discuss its relationship with existing work. 
In Section 3, we demonstrate the effectiveness of our 
model, and summarize the work and discuss future 
directions in Section 4. 

2 Related Work 
2.1 Topic Models 

Latent Dirichlet Allocation (LDA) [5] has gained pop- 
ularity for automatically extracting a representation 
of corpus. LDA is a completely unsupervised model 
that views documents as a mixture of probabilistic 
topics that is represented as a K dimensional ran- 
dom variable 6. In generative story, each document 
is generated by first picking a topic distribution 
from the Dirichlet prior and then use each document's 
topic distribution 9 to sample latent topic variables 
Zj. LDA makes the assumption that each word is 
generated from one topic where Zi is a latent variable 
indicating the hidden topic assignment for word Wi. 
The probability of choosing a word Wi under topic Zi , 
p(wi\zi,/3), depends on different documents. LDA is 
not appropriate for labeled corpora, so it has been 
extended in several ways to incorporate a supervised 
label set into its learning process. In 21j, Ramage et 
al. introduced Labeled LDA, a novel model that use 
multi-labeled corpora to address the credit assign- 
ment problem. Unlike LDA, Labeled LDA constrains 
topics of documents to a given label set. Instead of 
using symmetric Dirichlet distribution with a single 
hyper-parameter a as a Dirichlet prior on the topic 
distribution 0(d), it restricted Otd) to only the topics 
that correspond to observed labels A(<q. In [3l [2], 
the authors proposed a stochastic processes, where 
the Bayesian inference are no longer restricted to fi- 
nite dimensional spaces. Unlike LDA, it does not re- 
strict the given number of topics and allows arbitrary 
breadth and depth of topic hierarchies. The topics in 
hLDA model are represented by a topic tree, and each 
document is assigned a path from tree root to one of 
the tree leaves. Each document is generated by first 
sampling a path along the topic tree, and then sam- 
pling topic Zi among all the topic nodes in the path 
for each word Wi. The authors proposed a nested 
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Chinese restaurant process prior on path selection. 
p(occupied table i \ previous customers) 



p(unoccupied table \ previous customers) 



The nested Chinese restaurant process implies that 
the first customer sits at the first table and the nth 
customer sits at a table i which is drawn from the 
above equations. When a path of depth d is sam- 
pled, there are d number of topic nodes along the 
path, and the document sample the topic Zj among 
all topic nodes in the path for each word Wi based on 
GEM distribution. The experiment result of hLDA 
shows that the above document generating story can 
actually encode the semantic topic hierarchies. 

2.2 Transfer Learning 

Transfer learning has been extensively studied in the 
past decade to leverage the data (either labeled or un- 
labeled) from one task (or domain) to help another 
[17] . In summary, there are two types of transfer 
learning problems, i.e. shared label space or shared 
feature space. For shared label space [I], the main 
objective is to transfer the label information between 
observations from different distributions (i.e. domain 
adaptation) or uncover the relations between multi- 
ple labels for better prediction (i.e. multi-task learn- 
ing); for shared feature space, one of the most rep- 
resentative works is self-taught learning [T5], which 
uses sparse coding [13] to construct higher-level fea- 
tures via abundant unlabeled data to help improve 
the performance of the classification task with a lim- 
ited number of labeled examples. 

As an unsupervised generative model, LDA pos- 
sesses the advantage of modeling the generative pro- 
cedure of the whole dataset by establishing the rela- 
tionships between the documents and their associated 
hidden topics, and between the hidden topics and 
the concrete words, in an un-supervised way. Intu- 
itively, transfer learning on this generative model can 



be realized in two ways: one is utilizing the document 
labels from the other domain (with the assumption 
that the target domain and source domain share the 
same label space) so that the learned hidden topics 
can be much more meaningful, and the other is utiliz- 
ing the documents from other domains to enrich the 
contents so that we can learn a more robust shared 
latent space. 

In [U [T3] [5T] , the authors proposed the discrimina- 
tive LDA, which adds supervised information to the 
original LDA model, and guides the generative pro- 
cess of documents by using the labels. These methods 
clearly demonstrate the advantages of the discrimina- 
tive training of generative models. However, this is 
different from transfer learning since they simply uti- 
lize the labeled documents in the same domain to help 
build the generative model. But the same motivation 
can be applied to transfer learning, in which the su- 
pervised information is used to guide the generation 
of the common latent semantic space shared by both 
the source domain and the target domain. Trans- 
ferring information from source to target domain is 
extremely desirable in social media analysis, in which 
the target domain example features are very sparse, 
with a lot of missing features. Based on the shared 
common latent semantic space, the missing features 
can be recovered to some extent, which is helpful in 
better representing these examples. 

3 Transfer Topic Models 

Content analysis on social media data is a challeng- 
ing problem due to its unique language characteris- 
tics, which prevents standard text mining tools from 
being used to their full potential. Several models 
have been developed to overcome this barrier by ag- 
gregating many messages [TT] , applying temporal at- 
tributes [22 US], examining entities [H] or ap- 
plying manual annotation to guide the topic genera- 
tion |20j . The main motivation of our work is that 
previous unsupervised approaches for analyzing so- 
cial media data fail to achieve high accuracy due to 
the noise and sparseness, while supervised approaches 
require annotated data, which often requires a sig- 
nificant amount of human effort. Even though hi- 
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erarchical LDA has the natural capability to encode 
the semantic topic hierarchies with clusters of similar 
documents based on that hierarchies, it still cannot 
provide robust result for noise and sparseness (Fig 



YaHoO'. news 



2(a) ) because of the exchangeability assumption in 
Dirichlet Process mixture [B] . Exchangeability is es- 
sential for CRP mixture to be equivalent to a DP 
mixture; thus customers are exchangeable. However 
this assumption is not reasonable when it is applied 
to microblogging dataset. Based on our experiment, 
when the data set is noisy and sparse, unrelated 
words tends to cluster with other word because of 



co-occurrences. (Fig 2(a)) 



3.1 Knowledge Extraction 
Source Domain 



from 



Consider the task of modeling short and sparse docu- 
ments based on specific source domain structure. For 
example, a user who is interested in browsing docu- 
ments with particular categories of topics might pre- 
fer to see clusters of other documents based on his 
category. Clustering target domain by transferring 
his topic hierarchy category in the source domain, we 
could produce better document clusters and topic hi- 
erarchies by leveraging the context information from 
the source domain. 

User generated categories can be found from var- 
ious source domains: Twitter list, Flickr collection 
and set, Del.icio.us hierarchy and Wikipedia or News 
categories. We transfer source domain knowledge to 
target domain documents by assigning a prior path, 
sequence of assigned topic nodes from root to leaf 
node. The prior paths of the documents can be used 
to identify whether two target documents are similar 
or not, so that our model could group similar docu- 
ments cluster together while keep different documents 
separate based on the label. To label each document's 
prior path on the source domain hierarchy, we gen- 
erate word vectors of nodes in the source domain hi- 
erarchy. Each label is generated by measuring the 
similarity between source domain topic hierarchy and 
document. There are many ways to measure similar- 
ity between two vectors: cosine similarity, euclidean 
distance, Jaccard index and so on. For simplicity, 
in this paper, we label our prior knowledge of the 




Figure 1: (a) Yahoo! news home page and the cate- 
gories of news (b) The graphical model representation 
of transfer Hierarchical LDA with a modified-nested 
CRP prior 



target document by computing cosine similarity be- 
tween the target document and a node in the source 
domain hierarchy. We start from the root node of 
the hierarchy and keep assigning only the most sim- 
ilar topic node at each level while only considering 
child nodes of currently assigned topic nodes as next 
level candidates. 



3.2 Transfer Hierarchical LDA 

We incorporated the label hierarchies into the model 
by changing the prior of the path in hLDA in a way 
that the path selection favors the ones in the existing 
label hierarchies. Similar to original hLDA, thLDA 
models each document, a mixture of topics on the 
path, and generates each word from one of the topics. 
In the original hLDA model, the path prior is nested 
Chinese Restaurant Process (nCRP), where the prob- 
ability of choosing one topic in one topic layer de- 
pends on the number of documents already assigned 
to each node in that layer. That is, nodes assigned 
with more documents will have a higher probability 
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of generating new documents. However, thLDA in- 
corporates supervision by modifying path prior using 
following equation: 

p(table i without prior \ previous customers) 

^ = 

7 + k\ + n 

p(table i with prior \ previous customers) 

(4) = Ul + X 

w 7 + fcA + n 

piunoccupied table \ previous customers) 

7 



(5) 



7 + kX + n 



where k is the total number of labels for the doc- 
ument, d is the total number of documents, and A 
is the weighted indicator variable that controls the 
strength of a prior added into the original nested Chi- 
nese restaurant process. The graphical model of the 
new model is shown in Fig [I] (b) . Whether the cus- 
tomer sits at a specific table is not only related to 
how many customers are currently sitting at table i 
(rii) and how often a new customer chooses a new 
table (7), but is also related to how close the current 
customer is to customers at table i (A). Note that, 
in this work we use same A for different topic for 
simplicity; however table specific A or different prior 
for different table can be applied for a sophisticated 
model. 

In the transfer hierarchical LDA model, the docu- 
ments in a corpus are assumed to be drawn from the 
following generative process: 

(1) For each table k in the infinite tree 

(a) Generate /3(m ~ Dirichlet(rj) 

(2) For each document d 

(a) Generate C(<« ~ modified nCRP (7, A) 

(b) Generate 9^\{m,n} ~ GEM{m,n) 

(c) For each word, 

i. Choose level Z d ^ n \9 d ~ Discrete(9 d ) 

ii. Choose word W d , n \{zd,n, c d , (3} 

~ Discrete((3 Cd [z d ^ n }) 

The variables notation are: z djn , the topic assign- 
ments of the nth word in the dth document over 
L topics; w d ^ n , the nth word in the dth document. 
In Fig [l] (b) , T represents a collection of an infinite 



number of L level paths which is drawn from modi- 
fied nested CRP. c md represents the restaurant cor- 
reponding to the lih topic distribution in mth docu- 
ment and distribution of c md will be defined by the 
modified nested CRP conditioned on all the previous 
c n .i where n < m. We assume that each table in a 
restaurant is assigned a parameter vector /3 that is 
a multinomial topic distribution over vocabulary for 
each topic k from a Dirichlet prior ?/. We also assume 
that the words are generated from a mixture model 
which is a specific random distribution of each doc- 
ument. A document is drawn by first choosing an 
L level path through the modified-nested CRP and 
then drawing the words from the L topics which are 
associated with the restaurants along the path. 
= (l±, Ik) refers binary label presence indicators 
and is the label prior for label k. 

thLDA is able to transfer source domain knowl- 
edge to topic hierarchy by making an assignment re- 
lated to not only how many documents are assigned 
in topic i (n^) but also how close current topic is 
to the documents in topic i based on the source do- 
main knowledge (A). For unseen topics from source 
domain, we do not have knowledge to transfer from 
source domain, so we assign probability proportional 
to the number of documents already assigned in topic 
i. Unlike transferring knowledge only for labeled en- 
tities or given source domain knowledge, our model 
learns both unlabeled and labeled data based on dif- 
ferent prior probability equations (4), (5) and (6). 

A modified nested Chinese restaurant process can 
be imagined through the following scenario. As in 
[3J [2] , suppose that there are an infinite number of 
infinite-table Chinese restaurants in a city and there 
is only one headquarter restaurant. There is a note 
on each table in every restaurant which refers to 
other restaurants and each restaurant is referred to 
once. So, starting from the headquarter restaurant, 
all other restaurants are connected as a branching 
tree. One can think of the table as a door to other 
restaurants unless the current restaurant is the leaf 
node restaurant of the tree. So, starting from the root 
restaurant, one can reach the final destination table 
of leaf node restaurant, and the customers in the same 
restaurant share the same path. When a new cus- 
tomer arrives, instead of following the original nested 
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Chinese restaurant process, we put a higher weight on 
the table where similar customers are seated. 

3.3 Gibbs Sampling for Inference 

The inference procedure in our thLDA model is sim- 
ilar to hLDA except for the modification of the path 
prior. We use the gibbs sampling scheme: for each 
document in the corpus, we first sample a path Cd in 
the topic tree based on the path sampling of the rest 
of the documents c_d in the corpus: 

^ p(cd|w,c_ d ,z,?7,7, A) 

ocp(c d |c_ d ,7, A)p(w d |c,w_ d ,z,?7) 

Here the p(c d |c_ d , 7, A) is the prior probability on 
paths implied by the modified nested CRP, and 
p(wa|c, w_d, z, 77) is the probability of the data given 
for a particular path. 

The second step in the Gibbs sampling inference for 
each document is to sample the level assignments for 
each word in the document given the sampled path: 

^ P(^ln|z_(d,„),c,w,m,7r,77) 
oc p(zd, n |z d ,- n , m, 7r)p(w d ,„|z, c, W_( d>n ) , 77) 

The first term is the distribution over levels from 
GEM, and the second term is the word emission prob- 
ability given a topic assignment for each word. The 
only thing we need to change in the inference scheme 
is the path prior probability in the path sampling 
step. 

3.4 Parallel Inference Algorithm 

We developed thLDA parallel approximate inference 
algorithm on independent P processors to facilitate 
learning efficiency. In other words, we split the data 
into P parts, and implement thLDA on each processor 
performing Gibbs sampling on partial data. However, 
the gibbs sampler requires that each sample step is 
conditioned on the rest of the sampling states, hence 
we introduce a tree merge stage to help P Gibbs sam- 
plers share the sampling states periodically during 
the P independent inference processes. 



First, given the current global state of the CRP, we 
sample the topic assignment for word n in document 
d from processor p: 
(8) 

P(Zd,n,p|Z-(d,n,p),C,W,m,7T,77) OC 
p(zd,n,p\zd-n, P , m, 7r)p(w d ,„.p|z, C, W_( dnp) , 1]) 

Here z_( d n p ) is the vectors of topic allocations on 
process p excluding Z( d n p ) and w_( d n p ) is the nth 
word in document d on process p excluding W( d n p ). 
Note that on a separate processor, we need to use 
total vocabulary size and the number of words that 
have been assigned to the topic in the global state 
of the CRP. We merge the P topic assignment count 
table to a single set of counts after each LDA Gibbs 
iteration so that the global sampling state is shared 
among P processes. [15] . 

Second, on P separate processors, we sample path 
selection, conditional distribution for Cd, v given w and 
c variables for all documents other than d. 

,g, p(c d , P |w, c dp , z, rj, 7, A) oc 

p(c d ,p|c_ d , p , 7, A)p(w d , p |c, w_ d ,p, z, 77) 

The conditional distribution for both prior and the 
likelihood of the data given a particular choice of Cd, P 
are computed locally. Note that, to compute the sec- 
ond term it needs to be known the global state of the 
CRP: documents' path assignment and the number 
of instances of a word that have been assigned to the 
topic index by Cd on the tree. 

To merge topic trees for the global state of the 
CRP, we first define the similarity between two topics 
as the cosine similarity of two topics' word distribu- 
tion vector: 

(10) similarity^ = A • A/HAII HAH 

For given P number of infinite trees of Chinese restau- 
rants, we pick one tree as the base tree, and recur- 
sively merge topics in the remaining P-l trees into 
the base tree in a top-down manner. For each topic 
node ti being merged, we find the most similar node 
tj in the base tree where U and tj have same parent 
node. 
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3.5 Discussion 



foreach iteration do 

parallel foreach document d do 
foreach word n do 

sample the topic assignment 

P{Zd,n,p\%- (d,Tl,p) , C, W,m, 7T, 17) 

end 
end 

foreach process p do 

merge topic assignment count table to a 
single set of counts 
end 

parallel foreach document d do 

sample path p(c d , p |w, c d, p , z, 77, 7, A) 
using global state of the CRP 
end 

Pick one tree q as a base tree in every merge 
stage, 

foreach tree from process p G P-{ q} do 
foreach depth do 

foreach topic node i do 

find most similar topic node j G q 
if parent (i)==parcnt(j) 

merge topics i to topic j 
else 

add topic node i to tree q 

end 
end 
end 
end 

Algorithm 1: The parallel inference algorithm 



Our thLDA model is different from existing topic 
models. In 1LDA, they incorporate supervision by 
restricting 9d to each document's label set. So word- 
topic Zd.n assignments are restricted by its given la- 
bels. The number of unique labels K in 1LDA is the 
same as the number of topics K in LDA. Unlike 1LDA, 
thLDA does not directly correlate label and topic by 
modifying 9^, so the number of topics are not deter- 
mined or set by the number of labels, that number 
serves as a guidance for inferring latent topics. The 
proposed thLDA model is significant in that it can 
overcome the barrier that unsupervised models have 
when it is applied to noisy and sparse data. By trans- 
ferring different domain knowledge, thLDA also saves 
time and the cumbersome annotation efforts required 
for supervised models. thLDA has an advantage over 
LDA model by producing a topic hierarchy and doc- 
ument clusters without additionally computing sim- 
ilarity among topic distribution of documents. Fur- 
thermore, thLDA offers major advantages over other 
supervised or semisupervised LDA models by pro- 
viding mixture of detailed a topic hierarchy below a 
certain level in an unsupervised way, while provid- 
ing the above that level topic structure guided by 
given prior knowledge. By applying prior knowledge 
in both supervised and unsupervised ways, we can 
apply thLDA to learn deeper level of topic hierarchy 
than the given depth of source domain prior hierar- 
chy. In the following experiment, we will show the 
performance of thLDA a combination of supervised 
prior knowledge upto a certain level and unsupervised 
below that level. 



4 Experiment Results 

In our experiments, we used one source domain and 
three target domain text data sets to show the effec- 
tiveness of our transfer hierarchical LDA model. We 
used two well known text collections: the Associated 
Press (AP) Data set [TO] and Reuters Corpus Volume 
1(RCV1) Data set[H] and one sparse and noisy Mi- 
croblog data from Twitter for target domains and 
Yahoo! News categories for source domain. 
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Example Tweets 



My morning totally reminded me of The London 
Underground Song'. Old but still SO relevant! 

@DonnicWahtbcrg everyday I am grateful for all 
your wonderful woiui of encouragement. With 
cvciytliing that IVc (cont) 



Success! Taxi acquired early today! 



HandsonmcNc^isTw'obySamsamg[Excl\isiv"cJ | 

Gizmodo 

RT @ITKE: Bring your questions on viniializing 
business-critical apps to flic #VMwarechat happening 
NOW: 



Johnny Depp To Star In The Thin Man' Remake 
Wants Rob Marshall To Direct 
The NEW game. - Anne Hathaway and Rupert Grint. 
Anne Hathaway was in Alic c in Wonderland with 
Johnny Depp.... 

First Look at Alicia Keys and Swizz Bcatz with N cw 
Bom Baby Egypt: The babyis name is Egypt and 
Alicia and husban... 
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tfstfehas becn:o!c to prcp^.iofb: ::s capital funding to 
fall by atltird RT @gbniirrf3el UKscibudgetreprieve 
tfscicuts 



T/Iiritish budget cuts to include nearly 500K job 



Business Week: Asian Stocks Gain Metals Rally; 
Dollar Weakens to 1 5-Year Low: Asian stocks rose t.. 
#Busines5 #Mcney 
1-RHi StockAlertslGet Our 100% ireel-eatured 
Stock Alerts Newsletter 



Kl i£- : yucjifoiyou: Is cancer 'nian-madc'? US-UK 
study suggests cancer a rarity before industrial era. 
#green 

Real Weight-Loss Stories; 1 Kept the Pounds Off 

RT @peterhick: Daily aspirin helps prevent colon 
cancer says study 

tip @techmemc Y'iPhcnc Apps f lelp Weight Loss 



"Rl'(^MobLePayUSA: Joseph Priestley iTieLnglisri 
clergyman who discovered oxygen also invented the 
first carbonated soda water. 1 1 



"Today's Teatured Project - 3 500 copper canisters of 
unclaimed human remains are discovered at an 
Oregon State Asylum. " 



"I 'RLli Laith/Mars Coinparison Poster from NASA 
(Downloadable) - dcalspl.us via @dcalspltis" 

"Astronaut checks in on @i-'oursquare 220 miles 

alxwcL^rth's surface: /fsocialmedia." 

"13ioncers2010: Dispatch torn an LSrth Cornmunity 
Movement n LcoLocalizer: f/bioneers 
#neweconomy" 



(b) thLDA 



Figure 2: The topic hierarchy learned from (A) hLDA and (B) thLDA as well as example tweets assigned 
to the topic path. 



4.1 Dataset Description 

We used the web crawler to fetch news titles on 8 
categories in Yahoo! News: science, business, health, 
sports, politics, world, technology and entertainment, 
as shown in Fig [I] (a). We parsed and stemmed 
news titles and computed the tf-idf score to gener- 
ate weighted word vector for each topic category. We 
computed each topic category score using the top 50 
tf-idf weighted word vector and picked one optimal 
category per level as a label for each target docu- 
ment. 

Text Retrieval Conference AP (TREC-AP) [TU] 
contains Associated Press news stories from 1988 to 
1990. The original data includes over 200,000 docu- 
ments with 20 categories. The sample AP data set 
from [3] , which is sampled from a subset of the TREC 
AP corpus contains D= 2,246 documents with a vo- 
cabulary size V = 10,473 unique terms. We divided 
documents into 1,796 observed documents and 450 
held-out documents to measure the held-out predic- 



tive log likelihood. 

RCVlTT] is an archive of over 800,000 manually 
categorized newswire stories provided by Reuters, 
Ltd. distributed as a set of on-line appendices to 
a JMLR article. It also includes 126 categories, asso- 
ciated industries and regions. For this work a subset 
of RCV1 data set is used. Sample RCV1 data set 
has D = 55,606 documents with a vocabulary size 
V = 8,625 unique terms. We randomly divided it 
into 44,486 observed documents and 11,120 held-out 
documents for experiments. 

We have crawled the Twitter data for two 
weeks and obtained around 58,000 user profiles and 
2,000,000 tweets. Twitter users use some structure 
conventions such as a user-to-message relation (i.e. 
initial tweet authors, via, cc, by), type of message 
(i.e. Broadcast, conversation, or retweet messages), 
type of resources (i.e. URLs, hashtags, keywords) to 
overcome the 140 character limit. To capture trend- 
ing topics, many applications analyze twitter data 
and group similar trending topics using structure in- 
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formation (i.e. hashtag or url) and shows a list of top 
N trending topics. However according to [7], only 5% 
of tweets contain a hashtag with 41% of these con- 
taining a URL, and also only 22% of tweets include 
a URL. In this work, instead of using structure infor- 
mation (i.e. hashtag or url) of a tweet, we used only 
words. We removed structural information such as 
initial tweet authors, via, cc, by, and url and stemmed 
word to transform any word into its base form using 
the Porter Stemmcr. For the experiment, we ran- 
domly sampled and used D = 21,364 documents with 
a vocabulary size V = 31,181 unique terms. We ran- 
domly divided it into 16,023 observed documents and 
5,341 held-out documents. 



4.2 Performance Comparison 
Topic Modeling 



on 



LDA and Labeled LDA We implemented LDA 
and 1LDA using the standard collapsed Gibbs Sam- 
pling method. To compare the learned topic results 
between supervised LDA and unsupervised LDA 
topic models, we ran standard LDA with 9, 20 and 
50 topics and 1LDA with 9 topics: Yahoo news 8 top 
level categories and 1 topic as freedom of topic. The 
results show two main observations: first, multiple 
topics from the standard LDA mapped to popular 
topics (i.e. technology, entertainment and sports) in 
1LDA; second, not all 1LDA topics were discovered 
by standard LDA (i.e. topics such as politics, world, 
and science are not discovered). 

Hierarchical LDA We used HLDA-C with 
a fixed depth tree and a stick breaking prior on the 
depth weights. The topic hierarchy generated by 
hLDA is shown in Fig 2(a) Being an unsupervised 
model, the hLDA gives the result that totally de- 
pends on term co-occurance in the documents. hLDA 
gives a topic hierarchy that is not easily understood 
by human beings, because each tweet contains only 
small number of terms and co-occurance would be 
very sparse and less relevant compared to long docu- 
ments. Nodes in the topic hierarchy capture some 
clusters of words from the input documents, such 
as the 2nd topic in 3rd column in Fig 2(a) which 



has key words focusing on smart phones (Android, 
iPhonc, and Apple) and the 5th nodes in 3rd column 
that covers online multimedia resources. However, 
the 2nd level topic nodes are less informative and the 
relationship between child nodes in the 3rd level and 
their parent nodes in the 2nd level is less semantically 
meaningful. The topics belong to the same parent in 
level 3 usually do not relate to each other in our run 
result. Ideally, this should work for documents that 
are long in length and dense in word distribution and 
overlapping. However, hLDA gives less interpretable 
results on noisy and sparse data. 



Transfer Hierarchical LDA We implemented 
standard thLDA by modifying HLDA-C. Fig |2(b)| 
shows two important advantages of our model: first, 
topic nodes are better interpretable by transferring 
our source domain knowledge. Second, all the child 
topics reflect a strong semantic relationship with 
their parent. For example, in the topic "world", the 
first child "iraq war wikileak death" relates to the 
Iraq war topic, the second child "police kill injury 
drup" relates to criminals and more interestingly, the 
3rd child topic " chile rescue miner" denotes the recent 
event of 33 traped miners in Chile. Note that we only 
impose prior knowledge on the 2nd level and all 3rd 
level topics are automatically emerged from the data 
set. In "science" topics the 1st child topic "space 
Nasa moon" is about astronomy and the 2nd child 
topic "BP oil gulf spill environment" is about the 
recent Gulf Oil Spill. Furthermore, example tweets 
assigned to the topic nodes show a strong association 
with tweet clustering. 

To quantitatively measure performance of thLDA, 
we used predictive held-out likelihood. We divided 
the corpus into the observed and the held-out set, 
and approximate the conditional probability of the 
held-out set given the training set. To make a fair 
comparison we applied same hyper parameters that 
exist in all three models while applied a different hy- 
per parameter r\ to obtain a similar number of topics 
for hLDA and thLDA models. Following [SJ, we used 
outer samples: taking samples 100 iterations apart 
using a burn-in of 200 samples. We collected 800 
samples of the latent variables given the held-out doc- 
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Figure 3: The held-out predictive log likelihood comparison for LDA, hLDA, and thLDA model on three 
different data set 



uments and computed their conditional probability 
given the outer sample with the harmonic mean. Fig 
[3] illustrates the held-out likelihood for LDA, hLDA, 
and thLDA on Twitter, AP, and RCV1 corpus. In 
the figure, we applied a set of fixed topic cardinality 
on LDA and fixed depth 3 of the hierarchy on hLDA 
and thLDA. We see that thLDA always provides bet- 
ter predictive performance than LDA or hLDA on 
all three cases (Fig [3|. Interestingly, thLDA pro- 
vides significantly better performance on (a) Twit- 
ter data set, while hLDA shows poor performance 
than LDA. As Blei et al [31 [2] showed that even- 
tually with large numbers of topics LDA will dom- 
inate hLDA and thLDA in predictive performance, 
however thLDA performs better in predictive perfor- 
mance with reasonable numbers of topics. 

For manual evaluation of tweet assignments on 
learned topics, we randomly selected 100 tweets and 
manually annotate correctness. For LDA, we pick 
the highest assigned topic and for hLDA, thLDA, 
and parallel-thLDA, third level topic node is used 
and their accuracy were 41% 46%, 71%, and 56% re- 
spectively. Table [T] shows example tweets with their 
assigned topic from LDA, hLDA, and thLDA. 

4.3 Performance Comparison on Scal- 
ablity 

We evaluate our approximate parallel learning 
method on the twitter dataset with 5000 tweets. The 



log likelihood of training data during the Gibbs sam- 
pling iterations is shown in Fig|4](a). In all cases, the 
gibbs sampling converges to the distribution that has 
similar log likelyhood. Fig [I] (b) shows the speedup 
from the parallel inference method. In addition to the 
overhead of topic tree merging stage, the system also 
suffers from the overhead of state loading and saving 
time, which is similar since they occur everytime we 
need to update the global tree. Because of these over- 
heads, the system performs better when merging step 
is large. When we merge topic trees in every 50 gibbs 
iterations, the speedup with 4 processes is 2.36 times 
faster than with 1 process, but when we merge topic 
trees in every 200 gibbs iterations, the speedup is 3.25 
times. The overhead can be roughly seen if we extend 
the lines in Fig|4](b) to intersect with y-axes. Since 
the merging stage complexity is linear to the number 
of topic trees being merged, we see a greater over- 
head in 4 process experiment. However, the merging 
algorithm complexity does not depend on the size of 
dataset, which means the merging overhead will be 
ignorable when run with huge datasets. 



5 Conclusion 

In this paper, we proposed a transfer learning ap- 
proach for effective and efficient topic modeling anal- 
ysis on social media data. More specifically, we de- 
veloped transfer hierarchical LDA model, an exten- 



10 



Q 



O 
O 



^afellel IhLDA Inference - Accuracy 



-2.5 



ffliiB: * 'S .a s, 

x □ 



o 1 process 

□ 2 process 

» 4 process 

• 8 process 



1 00 200 300 400 500 600 700 800 900 1 000 

Num of Iterations 



(a) 



Parellel IhLDA Inference - Speedup 



w 1500 

o 
o 
CD 

m 1000 
E 



0"1 process 
Q 2 process 
* 4 process 



o' 
50 



100 



150 



200 



Number Of Iterations Per Step 

(b) 



Figure 4: (a)Parallel thLDA approximation inference performance comparison. (b)Parallel thLDA approxi- 
mation inference speedup comparison. 



Twcct 


LDA 


hLDA 


thLDA 


usnoaagov: NOAA scientists monitor 
ozone levels above 
Antarctic: 


scrvic switch roll 
broadband resid 
voip ip chang asterisk 


tcot tcaparti palin 
tea sgp parti gop 
democrat obama nvsen 


climat discov check earth 
water chang sea found 
scientist arctic warm 


FB RT: Breaking News: NATO official: 
Osama bin Laden is hiding 
in northwest Pakistan. - 


call edit via limit 

duti topic live 
playstat novel cafe 


one game yankc 
know new video 
find ranger make man 


minis! foreign pakistan 
isracl un us gaza 
sccur aid call 



Table 1: Example Tweets and their assigned topic with top 10 words from LDA, hLDA, and thLDA model. 



sion of hierarchical LDA model, which inferred the 
topic distributions of documents while incorporating 
knowledge from other domains. In addition, we de- 
signed a parallel inference framework to run paral- 
lel Gibbs sampler synchronously on multi-core ma- 
chines so as to perform topic modeling on large-scale 
datasets. Our work is significant in that it is among 
the frontier approaches to explore knowledge trans- 
fer from other domains to topic modeling for large- 
scale microblog analysis. For future work, we are 
interested in exploring other effective approaches to 
transfer domain knowledge in addition to topic priors 
as examined in the current paper. 
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